In this project, we have developed a predictive model for the Health Insurance Marketplace in the US. We examined the transitions between 2014, 2015 and 2016 before and after the Affordable Care Act’s coverage-related provisions took effect in 2014. Although, close to 22.8 million people gained insurance coverage after the act, the ACA came out successful, however, 5.9 million people lost their coverage, for a net increase of 16.9 million people with insurance. With in-depth research, trend and exploratory analysis of the data released by the ‘Centers for Medicare & Medicaid Services (CMS)’ with over twelve million records, we found significant hike in the Maximum Out of Pocket (MOOP) costs that the families bear after the issuer pays the rest. Considering the various state ideologies in the country, we found that with increasing liberal score, the monthly premium rate is significantly lower than the conservative states. We also developed a classification models using C50 and rpart algorithms and evaluated the models on the test sets to evaluate their acccuracies. Subsequently, we successfully trained a neural network model on the data for years 2014-2016 to predict the plan rates for the year 2017.
Health insurance is a term commonly used to describe any program that helps in the payment of medical expenses. There are a number of forms of health insurance that exist in the United States, the main forms include varieties of private and public coverage. Insurance companies or Insurers in the marketplace directly underwrite insurance policies relating to life, health, accident and medical risks, determining the assignment of premiums to the plan. It is interesting to note that life and health insurers generate revenue, not only through the specific activity of insurance underwriting, but also by investing premiums and annuity contracts. As a result, these insurers provide protection at a fraction of the potential loss by merging various risks.
According to Statistics, the number of Americans with private health insurance began to fall in the late 1990s and early 2000s, until the comprehensive health care reform law enacted in March 2010 commonly referred to as the Affordable Care Act (ACA) or the “Obamacare”. With the primary goals to Make affordable health insurance available to more people, expand the medicaid program to cover all adults with income below 138% of the federal poverty level and support innovative medical care delivery methods designed to lower the costs of health care generally. The number of uninsured people fell significantly since the ACA was signed in 2010, but then leveled off in 2016, which etches a significant question mark on the changes in the rules over the last three to four years. Therefore, we aim to discover some of the possible questions by digging deep into the dataset provided by the ‘Centers for Medicare & Medicaid Services (CMS)’.
The real test of how good a healthcare plan is can be difficult to assess, but one very crude benchmark is the maximum out of pocket. In layman’s terms, that’s the maximum a subscriber has to pay if the absolute worst happens. Moreover, ‘healthcare.gov’ claims that not all states have expanded their medicaid programs according to the ACA. Hence, we also studied the impact and trends of monthly premiums and MOOP across the US states. Furthermore, as a matter of fact, adults aged 35 to 64 are the primary market for life insurance. In the United States, an estimated 62.4% of total revenue for life insurance and annuities carriers is generated from individuals in this age group, with a similar trend expected throughout the rest of the world. Any increase in the quantity of individuals in this age bracket tends to increase revenue for the industry. The global number of adults aged 35 to 64 is expected to increase in 2017. This trend bound several questions of unbalanced premium rate distribution by the insurers, leveraging the age factor of individuals to mould their business models and plans to boost their profits. Thus, in our project, we have endeavored to address all such questions and possible shortcomings currently existing in the US health insurance marketplace.
The dataset used is the health insurance marketplace dataset which is compiled by the US government (Centers for Medicare & Medicaid Services) and is provided as an open source dataset. The dataset contains of 5 files namely-: 1. Business rules.csv-: This contains the basic rules of the health insurance business to be followed and verified for the policy to be valid. 2. BenefitsCostSharing.csv-: This file outlines the sharing of medicare costs between the insurance company and the policy holder. 3. PlanAttributes.csv-: This file contains the attributes and the benefits offered by the insurance plans. This file has the most sparsity. 4.Network.csv-: This file links the issuer ID to the name of the company and the network they are operating in. 5. Rate.csv-: This is the file which contains all the details regarding the rate of a policy based on different factors such as tobacco consumption, number of dependents etc.
The dataset was very messy in its nascent stage and needed lots of cleaning and imputation to be put to use. It had lots of missing values, ontological anomalies, extreme values( like INT_MAX’s, 999999, Zeroes etc.), white spaces, Interval vs single value data, duplicated rows and many more.
Most of the Categorical data was given as strings and needed to be mapped into integer data type for the purpose of using it in neural networks for prediction.
The variables in the dataset are as follows:
Importing the data from "final_ratedata.csv" from overall dataset and saving it into the R dataset named “rate”.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(choroplethr)## Loading required package: acs
## Loading required package: stringr
## Loading required package: XML
##
## Attaching package: 'acs'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:base':
##
## apply
library(choroplethrMaps)
library(ggplot2)
library(mapproj)## Loading required package: maps
library(maps)
library(scales)##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(ggthemes)
library(magrittr)
library(dplyr)
library(reshape2)
library(arules)## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)## Loading required package: grid
library(C50)
library(caret)## Loading required package: lattice
library(rpart)
library(Boruta)## Loading required package: ranger
library(neuralnet) ##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(ggplot2)
library(choroplethr)
library(RColorBrewer)
library(pals)
library(scales)
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:acs':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
library(DT)
#Read in the rate.csv
ratedata=read.csv("Rate.csv", header = TRUE)
ratecopy=ratedata
new_df <- ratedata[-which(duplicated(ratedata)), ]
# Manually remove outliers
cleanrate=new_df[-which(new_df$IndividualRate==0),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==999999),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==9999.99),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==9999.00),]
cleanrate=cleanrate[-which(cleanrate$IndividualRate==0.01),]
boxplot(cleanrate$IndividualRate, main= "boxplot- individualrate for non tobacco user")ratedata=cleanrate
# imputing values in rate columns (other than individual rate)
ratedata$IndividualTobaccoRate[is.na(ratedata$IndividualTobaccoRate)]=ratedata$IndividualRate[is.na(ratedata$IndividualTobaccoRate)]
ratedata$Couple[is.na(ratedata$Couple)]=ratedata$IndividualRate[is.na(ratedata$Couple)]
ratedata$PrimarySubscriberAndOneDependent[is.na(ratedata$PrimarySubscriberAndOneDependent)] =ratedata$IndividualRate[is.na(ratedata$PrimarySubscriberAndOneDependent)]
ratedata$PrimarySubscriberAndTwoDependents[is.na(ratedata$PrimarySubscriberAndTwoDependents)] =ratedata$IndividualRate[is.na(ratedata$PrimarySubscriberAndTwoDependents)]
ratedata$PrimarySubscriberAndThreeOrMoreDependents[is.na(ratedata$PrimarySubscriberAndThreeOrMoreDependents)]=ratedata$IndividualRate[is.na(ratedata$PrimarySubscriberAndThreeOrMoreDependents)]
ratedata$CoupleAndOneDependent[is.na(ratedata$CoupleAndOneDependent)]=ratedata$IndividualRate[is.na(ratedata$CoupleAndOneDependent)]
ratedata$CoupleAndTwoDependents[is.na(ratedata$CoupleAndTwoDependents)]=ratedata$IndividualRate[is.na(ratedata$CoupleAndTwoDependents)]
ratedata$CoupleAndThreeOrMoreDependents[is.na(ratedata$CoupleAndThreeOrMoreDependents)]=ratedata$IndividualRate[is.na(ratedata$CoupleAndThreeOrMoreDependents)]
#write.csv(ratedata,"final_ratedata.csv")Read the new rate data file
rate <- ratedataExtract data from the rate file individually for all three years 2014, 2015 and 2016
rate_2014 <- rate[which(rate$BusinessYear=='2014' & rate$IndividualRate<9000),]
rate_2015 <- rate[which(rate$BusinessYear=='2015' & rate$IndividualRate<9000),]
rate_2016 <- rate[which(rate$BusinessYear=='2016' & rate$IndividualRate<9000),]We explore the dataset as the first step to get an insight into how the variables relate to one another. We want to understand different distributions of each variable, and explore the trend of plan rates from 2014 to 2016. We also explore correlations between plan rates and the states in which plans are offered.
Prepare a dataset with year 2014 grouped by statecode, calculating the number of unique plans in each state. We use group_by, rowwise and summarise on the dataset to group. Continue this analysis for each year
stateplans_2014 <- rate_2014 %>%
group_by(StateCode) %>%
rowwise() %>%
summarise(NumberofPlans = n_distinct(PlanId),
value = n_distinct(PlanId),
StateCode = StateCode,
value = n_distinct(StateCode))#Map the time of day to different fill colors
ggplot(data=stateplans_2014, aes(x=StateCode, y=NumberofPlans, fill=StateCode)) +
geom_bar(stat="identity")stateplans_2015 <- rate_2015 %>%
group_by(StateCode) %>%
rowwise() %>%
summarise(NumberofPlans = n_distinct(PlanId),
value = n_distinct(PlanId),
StateCode = StateCode,
value = n_distinct(StateCode))
ggplot(data=stateplans_2015, aes(x=StateCode, y=NumberofPlans, fill=StateCode)) +
geom_bar(stat="identity")stateplans_2016 <- rate_2016 %>%
group_by(StateCode) %>%
rowwise() %>%
summarise(NumberofPlans = n_distinct(PlanId),
value = n_distinct(PlanId),
StateCode = StateCode,
value = n_distinct(StateCode))
ggplot(data=stateplans_2016, aes(x=StateCode, y=NumberofPlans, fill=StateCode)) +
geom_bar(stat="identity")Create a map showing median plan rates in each state.
data(state.regions)
stateRates <- rate %>%
filter(IndividualRate<9000) %>%
group_by(StateCode) %>%
summarise(MeanIndividualRate = mean(IndividualRate),
value = median(IndividualRate)) %>%
inner_join(state.regions, by=c("StateCode"="abb"))## Warning: Column `StateCode`/`abb` joining factor and character vector,
## coercing into character vector
state_choropleth(stateRates,
title="Median Health Insurance Plan Individual Rate ($ / month)",
num_colors=1)## Warning in self$bind(): The following regions were missing and are being
## set to NA: minnesota, california, maryland, colorado, washington, vermont,
## kentucky, massachusetts, connecticut, new york, rhode island, district of
## columbia
The above map shows that states like Wisconsin, Wyoming, New Jersy and South Carolina have the highest median individual health premium rate.
Create a map showing mean rate difference between tobaco users and non-users
#install.packages("mapproj")
rate_ind_tobacco = subset(rate, (rate$Age != "Family Option" & rate$IndividualRate < "9000" & rate$BusinessYear == "2015" & rate$Tobacco != "No Preference"))
bystate <- rate_ind_tobacco %>%
select(StateCode, IndividualRate, IndividualTobaccoRate)
bystatecount<-bystate %>%
group_by(StateCode) %>%
summarize(mean_non_Tobacco_Rate = mean(IndividualRate),
mean_Tobacco_Rate = mean(IndividualTobaccoRate),
mean_Diff= (mean_Tobacco_Rate - mean_non_Tobacco_Rate)) %>%
arrange(desc(mean_Diff))
bystatecount$region <- tolower(state.name[match(bystatecount$StateCode,state.abb)])
us_state_map = map_data('state')
statename<-group_by(us_state_map, region) %>%
summarise(long = mean(long), lat = mean(lat))
mapdata <- left_join(bystatecount, us_state_map, by="region")
p <- ggplot()+geom_polygon(data=mapdata, aes(x=long, y=lat, group = group, fill=mapdata$mean_Diff),colour="white")+
scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar")
p1 <- p+theme_bw()+
labs(fill = "Mean Difference $/mon",
title = "Mean Rate Difference Bewteen Tobacco Users and Non-Tobacco Users", x="", y="")
p2<-p1 + scale_y_continuous(breaks=c()) + scale_x_continuous(breaks=c()) + theme(panel.border = element_blank())
p3<-p2+geom_text(data = statename, aes(x=long, y=lat, label=region), na.rm = T, size=2)+
coord_map()
p3The above plot shows that the greatest difference in mean rates for Tobacco users and non-users exist in the state of Wyoming. This means that there is a huge difference in the monthly premium rate depending on if the user opts for tobacco plan or not.
Prepare the data for further analysis. Here we explore the rates for the year 2014
rate_nofam = subset(rate, (rate$Age != "Family Option" & rate$BusinessYear == "2014"))
rate_fam = subset(rate, (rate$Age == "Family Option" & rate$BusinessYear == "2014"))
rate_final_2015 = subset(rate, BusinessYear == "2015")
rate_final_2014 = subset(rate, BusinessYear == "2014")
IndividualOption <- subset(rate_final_2015, (rate_final_2015$Age != "Family Option" & rate_final_2015$IndividualRate < "9000"),
select = c(BusinessYear:IndividualTobaccoRate))The following analysis shall show the relationship between plan rates and their carriers grouped by each state.
bystate <- IndividualOption %>%
select(StateCode, IssuerId, PlanId, IndividualRate)
bystatecount<-bystate %>%
group_by(StateCode) %>%
summarize(Carriers = length(unique(IssuerId)),
PlanOffered = length(unique(PlanId)),
MeanIndRate= mean(IndividualRate),
MedianIndRate = median(IndividualRate)) %>%
arrange(desc(PlanOffered))
head(bystatecount)## # A tibble: 6 x 5
## StateCode Carriers PlanOffered MeanIndRate MedianIndRate
## <fct> <int> <int> <dbl> <dbl>
## 1 WI 32 834 467. 415.
## 2 TX 33 716 262. 233.
## 3 FL 29 621 308. 310.
## 4 OH 33 538 256. 39.7
## 5 PA 32 538 328. 302.
## 6 IL 21 466 391. 352.
##Graph1 - Carriers vs. Plan Available By State
bystatecount$Carriers <- ifelse(bystatecount$Carriers<15, "(0,15)", ifelse(bystatecount$Carriers>=25, "[25,35)","[15,25)"))
a <- ggplot(bystatecount, aes(x=reorder(StateCode, PlanOffered), y=PlanOffered), height = 40, width = 40)+
geom_bar(aes(fill = Carriers), stat="identity")+coord_flip()+
ggtitle("Carriers vs. Plans Available By State")+
labs(x="State", y="Plans Available")
aIn this analysis, we explore the relationship between Individual plan rates and Age.
#install.packages("magrittr")rate_summary_individual_age<-rate_nofam %>%
group_by(Age) %>%
summarize(Number_of_Providers = length(unique(IssuerId)),
Number_of_Plans_Provided = length(unique(PlanId)),
Average_Individual_Rate = mean(IndividualRate),
MedianIndRate = median(IndividualRate),
Standard_deviation_Individual_Rate= sd(IndividualRate),
Average_Individual_Rate_Tobacco = mean(IndividualTobaccoRate,na.rm=TRUE),
Per_state_Population=length(StateCode)) %>%
arrange(desc(Per_state_Population))
rate_summary_family<-rate_fam %>%
group_by(Age) %>%
summarize(Number_of_Providers = length(unique(IssuerId)),
Number_of_Plans_Provided = length(unique(PlanId)),
Average_Individual_Rate = mean(IndividualRate),
MedianIndRate = median(IndividualRate),
Standard_deviation_Individual_Rate= sd(IndividualRate),
Average_Individual_Rate_Tobacco = mean(IndividualTobaccoRate,na.rm=TRUE),
Per_state_Population=length(StateCode)) %>%
arrange(desc(Per_state_Population))
rate_summary_plan<-rate_nofam %>%
group_by(IssuerId) %>%
summarize(
Number_of_Plans_Provided = length(unique(PlanId)),
Average_Individual_Rate = mean(IndividualRate),
Average_fam2_Rate = mean(CoupleAndOneDependent),
MedianIndRate = median(IndividualRate),
Standard_deviation_Individual_Rate = sd(IndividualRate),
Average_Individual_Rate_Tobacco = mean(IndividualTobaccoRate,na.rm=TRUE),
Per_state_Population=length(StateCode)) %>%
arrange(desc(Per_state_Population))ggplot(rate_summary_individual_age, aes(x=Age, y=Average_Individual_Rate))+
geom_bar(aes(fill = Age), stat="identity")+coord_flip()+
ggtitle("Individual Rate per Age")+
labs(x="Age", y="Average Rate")In general, a positive upward trend can be seen in the Individual premium plan rates as the age of the person increases. This is quite expected.
Here we wish to explore the relationship between
family_rate <- subset(rate, (rate$Age == "Family Option"))
Family_summary<-family_rate %>%
group_by(StateCode) %>%
summarize(Number_of_Providers = length(unique(IssuerId)),
Number_of_Plans_Provided = length(unique(PlanId)),
Average_Individual_Rate = mean(IndividualRate),
MedianIndRate = median(IndividualRate),
Standard_deviation_Individual_Rate= sd(IndividualRate),
Average_Couple_Rate = mean(Couple),
MedianInd_Couple_Rate = median(Couple),
Standard_deviation_Couple_Rate= sd(Couple),
Average_PrimarySubscriberAndOneDependent = mean(PrimarySubscriberAndOneDependent),
MedianInd_PrimarySubscriberAndOneDependent = median(PrimarySubscriberAndOneDependent),
Standard_deviation_PrimarySubscriberAndOneDependent= sd(PrimarySubscriberAndOneDependent),
Average_PrimarySubscriberAndTwoDependents = mean(PrimarySubscriberAndTwoDependents),
MedianInd_PrimarySubscriberAndTwoDependent = median(PrimarySubscriberAndTwoDependents),
Standard_deviation_PrimarySubscriberAndTwoDependent= sd(PrimarySubscriberAndTwoDependents),
Average_PrimarySubscriberAndThreeOrMoreDependents = mean(PrimarySubscriberAndThreeOrMoreDependents),
MedianInd_PrimarySubscriberAndThreeOrMoreDependents = median(PrimarySubscriberAndThreeOrMoreDependents),
Standard_deviation_PrimarySubscriberAndThreeOrMoreDependents= sd(PrimarySubscriberAndThreeOrMoreDependents),
Average_CoupleAndOneDependent= mean(CoupleAndOneDependent),
MedianInd_CoupleAndOneDependent = median(CoupleAndOneDependent),
Standard_deviation_CoupleAndOneDependent= sd(CoupleAndOneDependent),
Average_CoupleAndTwoDependents = mean(CoupleAndTwoDependents),
MedianInd_CoupleAndTwoDependents = median(CoupleAndTwoDependents),
Standard_deviation_CoupleAndTwoDependents= sd(CoupleAndTwoDependents),
Average_CoupleAndThreeOrMoreDependents = mean(CoupleAndThreeOrMoreDependents),
MedianInd_CoupleAndThreeOrMoreDependents = median(CoupleAndThreeOrMoreDependents),
Standard_deviation_CoupleAndThreeOrMoreDependents= sd(CoupleAndThreeOrMoreDependents)
) %>%
arrange(desc(Number_of_Providers))
temp1<-family_rate %>%
group_by(StateCode) %>%
summarize(Average_Individual_Rate = mean(IndividualRate),
Average_Couple_Rate = mean(Couple),
Average_PrimarySubscriberAndOneDependent = mean(PrimarySubscriberAndOneDependent),
Average_PrimarySubscriberAndTwoDependents = mean(PrimarySubscriberAndTwoDependents),
Average_PrimarySubscriberAndThreeOrMoreDependents = mean(PrimarySubscriberAndThreeOrMoreDependents),
Average_CoupleAndTwoDependents = mean(CoupleAndTwoDependents),
Average_CoupleAndThreeOrMoreDependents = mean(CoupleAndThreeOrMoreDependents)
) %>%
arrange(desc(Average_Individual_Rate))
f1 <- ggplot(Family_summary, aes(x=StateCode, y=Average_Couple_Rate))+
geom_bar(aes(fill = StateCode), stat="identity")+
ggtitle("Average Couple Rate per State")+
labs(x="State", y="Rates")
f2 <- ggplot(Family_summary, aes(x=StateCode, y=Average_PrimarySubscriberAndOneDependent))+
geom_bar(aes(fill = StateCode), stat="identity")+
ggtitle("Average Primary +1 dependent Rate per State")+
labs(x="State", y="Rates")
f3 <- ggplot(Family_summary, aes(x=StateCode, y=Average_PrimarySubscriberAndTwoDependents))+
geom_bar(aes(fill = StateCode), stat="identity")+
ggtitle("Average Primary + 2 dependent Rate per State")+
labs(x="State", y="Rates")
f4 <- ggplot(Family_summary, aes(x=StateCode, y=Average_PrimarySubscriberAndThreeOrMoreDependents))+
geom_bar(aes(fill = StateCode), stat="identity")+
ggtitle("Average Primary + 3 or more dependent Rate per State")+
labs(x="State", y="Rates")
f5 <- ggplot(Family_summary, aes(x=StateCode, y=Average_CoupleAndOneDependent))+
geom_bar(aes(fill = StateCode), stat="identity")+
ggtitle("Average Couple Rate +1 dependent per State")+
labs(x="State", y="Rates")
f6 <- ggplot(Family_summary, aes(x=StateCode, y=Average_CoupleAndTwoDependents))+
geom_bar(aes(fill = StateCode), stat="identity")+
ggtitle("Average Couple Rate +2 dependent per State")+
labs(x="State", y="Rates")
f7 <- ggplot(Family_summary, aes(x=StateCode, y=Average_CoupleAndThreeOrMoreDependents))+
geom_bar(aes(fill = StateCode), stat="identity")+
ggtitle("Average Couple Rate +3 or more dependent per State")+
labs(x="State", y="Rates")
temp<- melt(temp1,id.var="StateCode")
ggplot(temp, aes(x = StateCode, y = value, fill = variable)) +
geom_bar(stat = "identity")+coord_flip()+
ggtitle("Average Rates per category for States")+
labs(x="State", y="Rates")The above plot shows the distribution of average plan rates of each category of plans(Individual, couple, primary subsciber and one dependent, etc) per state.
Here, we aim to explore the associations between different rules while considering what benefits are associated with which plan. These associations help us relate which rules are the most important ones and appear in most plans.
b_rules <- read.csv("BusinessRules.csv", na.strings=c("","NA"))
bru<- subset( b_rules, select = -c(BusinessYear, ProductId, StandardComponentId, ImportDate, IssuerId2,VersionNum,TIN, RowNumber, IssuerId ) )
bru<-bru[which(bru$SourceName!="HIOS"),]Measuring rule importance by using support and confidence
Support and confidence are the two criteria to help us decide whether a pattern is “interesting”. By setting thresholds for these two criteria, we can easily limit the number of interesting rules or item-sets reported.
For item-sets \(X\) and \(Y\), the support of an item-set measures how frequently it appears in the data: \[support(X)=\frac{count(X)}{N},\] where N is the total number of transactions in the database and count(X) is the number of observations (transactions) containing the item-set X. Of course, the union of item-sets is an item-set itself, i.e., if \(Z={X,Y}\), then \[support(Z)=support(X,Y).\]
For a rule \(X \rightarrow Y\), the rule's confidence measures the relative accuracy of the rule: \[confidence(X \rightarrow Y)=\frac{support(X, Y)}{support(X)}\]
This measures the joint occurrence of X and Y over the X domain. If whenever X appears Y tends to be present too, we will have a high \(confidence(X\rightarrow Y)\). The ranges of the support and confidence are \(0 \leq support,\ confidence \leq 1\).
lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. \[lift(X\rightarrow Y)=\frac{confidence(X\rightarrow Y)}{support(Y)}\]
#apriori
b_rules1<- apriori(bru, parameter = list(minlen=2, supp=0.7, conf=0.7))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.7 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 5835
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[304 item(s), 8336 transaction(s)] done [0.00s].
## sorting and recoding items ... [8 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [302 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
b_set<-sort(b_rules1, by="lift")We use arulesViz package to visualize the scatter plots of
plot(b_set)plot(head(sort(b_set, by = "lift"), n=50), method = "graph", control=list(cex=.5))The plot above represents the association rules in the form of a graph network, showing only first 50 associations
inspect(b_set[1:3])## lhs rhs support confidence lift count
## [1] {SourceName=SERFF,
## SameSexPartnerAsSpouseIndicator=Yes} => {DomesticPartnerAsSpouseIndicator=Yes} 0.7136516 0.9815212 1.296460 5949
## [2] {SourceName=SERFF,
## SameSexPartnerAsSpouseIndicator=Yes,
## AgeDeterminationRule=Age on effective date} => {DomesticPartnerAsSpouseIndicator=Yes} 0.7048944 0.9812959 1.296163 5876
## [3] {DomesticPartnerAsSpouseIndicator=Yes} => {SameSexPartnerAsSpouseIndicator=Yes} 0.7521593 0.9935034 1.296063 6270
The top association rule is {SourceName=SERFF,SameSexPartnerAsSpouseIndicator=Yes} => DomesticPartnerAsSpouseIndicator=Yes with 0.7136516, 0.9815212, 1.296460 as support, confidence and lift, respectively
Since we do not have defined rules for classifying the dataset, we can use Decision trees on the data. We compare the performance of rpart and C50 algorithm by validating the models on test data (20%).
set.seed(1234)
train_index <- sample(seq_len(nrow(bru)), size = 0.8*nrow(bru))
bru_train<-bru[train_index, ]
bru_test<-bru[-train_index, ]Create the C50 decision tree model and predict
C50_dec_tree <- C5.0(DentalOnlyPlan ~ SourceName+
AgeDeterminationRule +EnrolleeContractRateDeterminationRule+ ChildrenOnlyContractMaxChildrenRule+MinimumTobaccoFreeMonthsRule+
TwoParentFamilyMaxDependentsRule + DomesticPartnerAsSpouseIndicator+SingleParentFamilyMaxDependentsRule+SameSexPartnerAsSpouseIndicator + MarketCoverage, data = bru_train, na.action = na.omit)summary(C50_dec_tree)##
## Call:
## C5.0.formula(formula = DentalOnlyPlan ~ SourceName +
## + SingleParentFamilyMaxDependentsRule + SameSexPartnerAsSpouseIndicator
## + MarketCoverage, data = bru_train, na.action = na.omit)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Apr 21 19:05:32 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 6356 cases (11 attributes) from undefined.data
##
## Decision tree:
##
## MinimumTobaccoFreeMonthsRule = 60: No (0)
## MinimumTobaccoFreeMonthsRule in {12,3,4,6,999}:
## :...MarketCoverage = Individual: No (2841/39)
## : MarketCoverage = SHOP (Small Group):
## : :...ChildrenOnlyContractMaxChildrenRule = 3 or more: No (401/22)
## : ChildrenOnlyContractMaxChildrenRule = 1:
## : :...DomesticPartnerAsSpouseIndicator = No: No (21)
## : DomesticPartnerAsSpouseIndicator = Yes: Yes (94/37)
## MinimumTobaccoFreeMonthsRule = Not Applicable:
## :...SourceName = HIOS: Yes (0)
## SourceName = OPM: No (35)
## SourceName = SERFF: [S1]
##
## SubTree [S1]
##
## EnrolleeContractRateDeterminationRule = There are rates specifically for couples and for families (not just addition of individual rates): Yes (462)
## EnrolleeContractRateDeterminationRule = A different rate (specifically for parties of two or more)for each enrollee is added together:
## :...ChildrenOnlyContractMaxChildrenRule = 3 or more: Yes (2288/551)
## ChildrenOnlyContractMaxChildrenRule = 1:
## :...TwoParentFamilyMaxDependentsRule in {1,2}: Yes (15)
## TwoParentFamilyMaxDependentsRule = 3 or more: No (199/63)
##
##
## Evaluation on training data (6356 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 9 712(11.2%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 3373 588 (a): class No
## 124 2271 (b): class Yes
##
##
## Attribute usage:
##
## 100.00% MinimumTobaccoFreeMonthsRule
## 52.82% MarketCoverage
## 47.48% ChildrenOnlyContractMaxChildrenRule
## 47.18% SourceName
## 46.63% EnrolleeContractRateDeterminationRule
## 3.37% TwoParentFamilyMaxDependentsRule
## 1.81% DomesticPartnerAsSpouseIndicator
##
##
## Time: 0.0 secs
# See docs for predict # ?C50::predict.C5.0
c50_pred<-predict(C50_dec_tree, bru_test)
confusionMatrix(table(c50_pred, bru_test$DentalOnlyPlan))## Confusion Matrix and Statistics
##
##
## c50_pred No Yes
## No 882 40
## Yes 152 594
##
## Accuracy : 0.8849
## 95% CI : (0.8686, 0.8998)
## No Information Rate : 0.6199
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7638
## Mcnemar's Test P-Value : 1.14e-15
##
## Sensitivity : 0.8530
## Specificity : 0.9369
## Pos Pred Value : 0.9566
## Neg Pred Value : 0.7962
## Prevalence : 0.6199
## Detection Rate : 0.5288
## Detection Prevalence : 0.5528
## Balanced Accuracy : 0.8950
##
## 'Positive' Class : No
##
We see that the accuracy of C50 model is 0.8849 and kappa value is 0.7638
set.seed(123)
bru_rpart_model<-rpart(DentalOnlyPlan~., data=bru_train, cp=0.01)
# here we use rpart::cp = *complexity parameter* = 0.01Create the rpart decision tree model and predict
bru_rpart_pred<-predict(bru_rpart_model, bru_test,type = 'class')
confusionMatrix(table(bru_rpart_pred, bru_test$DentalOnlyPlan))## Confusion Matrix and Statistics
##
##
## bru_rpart_pred No Yes
## No 987 39
## Yes 47 595
##
## Accuracy : 0.9484
## 95% CI : (0.9367, 0.9586)
## No Information Rate : 0.6199
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8909
## Mcnemar's Test P-Value : 0.4504
##
## Sensitivity : 0.9545
## Specificity : 0.9385
## Pos Pred Value : 0.9620
## Neg Pred Value : 0.9268
## Prevalence : 0.6199
## Detection Rate : 0.5917
## Detection Prevalence : 0.6151
## Balanced Accuracy : 0.9465
##
## 'Positive' Class : No
##
We see that the accuracy of rpart model is 0.9484 and kappa value is 0.8909. We conclude that rpart performs better than C50 model for classyfying business rules.
To find the best complexity parameter, we prune the tree according to the optimal cp, complexity parameter to which the rpart object will be trimmed. Instead of using the real error (e.g., \(1-R^2\), RMSE) to capture the discrepancy between the observed labels and the model-predicted labels, we will use the xerror which averages the discrepancy between observed and predicted classifications using cross-validation
set.seed(12345)
control = rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
bru_rpart_ctrl_model= rpart(DentalOnlyPlan ~ ., data = bru_train, control = control)
plotcp(bru_rpart_ctrl_model)printcp(bru_rpart_ctrl_model)##
## Classification tree:
## rpart(formula = DentalOnlyPlan ~ ., data = bru_train, control = control)
##
## Variables actually used in tree construction:
## [1] CohabitationRule
## [2] DependentMaximumAgRule
## [3] DomesticPartnerAsSpouseIndicator
## [4] EnrolleeContractRateDeterminationRule
## [5] MarketCoverage
## [6] MinimumTobaccoFreeMonthsRule
## [7] SourceName
## [8] StateCode
##
## Root node error: 2657/6668 = 0.39847
##
## n= 6668
##
## CP nsplit rel error xerror xstd
## 1 0.74858863 0 1.000000 1.000000 0.0150464
## 2 0.05005645 1 0.251411 0.255928 0.0093005
## 3 0.03274370 2 0.201355 0.205871 0.0084336
## 4 0.02070004 3 0.168611 0.171622 0.0077573
## 5 0.01561912 4 0.147911 0.153933 0.0073744
## 6 0.00940911 7 0.094091 0.103500 0.0061112
## 7 0.00301091 8 0.084682 0.096726 0.0059162
## 8 0.00225819 10 0.078660 0.094467 0.0058494
## 9 0.00213273 11 0.076402 0.088822 0.0056786
## 10 0.00188182 14 0.070004 0.086187 0.0055968
## 11 0.00112909 16 0.066240 0.082047 0.0054654
## 12 0.00100364 18 0.063982 0.080542 0.0054167
## 13 0.00075273 21 0.060971 0.079413 0.0053798
## 14 0.00065864 23 0.059466 0.075273 0.0052422
## 15 0.00056455 27 0.056831 0.075273 0.0052422
## 16 0.00037636 29 0.055702 0.071133 0.0051003
## 17 0.00031364 34 0.053820 0.068122 0.0049943
## 18 0.00025091 40 0.051938 0.068122 0.0049943
## 19 0.00018818 43 0.051186 0.067746 0.0049808
## 20 0.00000000 47 0.050433 0.064358 0.0048581
set.seed(1234)
selected_tr <- prune(bru_rpart_ctrl_model, cp= bru_rpart_ctrl_model$cptable[which.min(bru_rpart_ctrl_model$cptable[,"xerror"]),"CP"])
bru_pred_tune<-predict(selected_tr, bru_test,type = 'class')
confusionMatrix(table(bru_pred_tune, bru_test$DentalOnlyPlan))## Confusion Matrix and Statistics
##
##
## bru_pred_tune No Yes
## No 1008 28
## Yes 26 606
##
## Accuracy : 0.9676
## 95% CI : (0.958, 0.9756)
## No Information Rate : 0.6199
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9313
## Mcnemar's Test P-Value : 0.8918
##
## Sensitivity : 0.9749
## Specificity : 0.9558
## Pos Pred Value : 0.9730
## Neg Pred Value : 0.9589
## Prevalence : 0.6199
## Detection Rate : 0.6043
## Detection Prevalence : 0.6211
## Balanced Accuracy : 0.9653
##
## 'Positive' Class : No
##
After pruning, the accuracy of rpart model increased to 0.9676 and kappa value increased to 0.9313. We are able to create a fairly accurate classification model with rpart algorithm.
Convert Age to factor variable and sourcename to numeric factor variable. Also, create a new column date_diff as the difference between Rate expiration date and Rate effective date
#create smaller data set of 50000 samples
smalldata <- rate[sample(nrow(rate),80000),]
#preprocesssing of smalldata
smalldata$RatingAreaId= gsub("Rating Area ", "", smalldata$RatingAreaId)
smalldata$Tobacco= gsub("No Preference", "0", smalldata$Tobacco)
smalldata$Tobacco= gsub("Tobacco User/Non-Tobacco User", "1", smalldata$Tobacco)
smalldata$Age= gsub("65 and over", "65", smalldata$Age)
smalldata$Age= gsub("0-20", "10", smalldata$Age)
smalldata$Age= gsub("Family Option", "0", smalldata$Age)
smalldata$SourceName= gsub("HIOS", "0", smalldata$SourceName)
smalldata$SourceName= gsub("OPM", "1", smalldata$SourceName)
smalldata$SourceName= gsub("SERFF", "2", smalldata$SourceName)
smalldata$date_diff <- as.Date(as.character(smalldata$RateExpirationDate), format="%Y-%m-%d")-
as.Date(as.character(smalldata$RateEffectiveDate), format="%Y-%m-%d")
smalldata$FederalTIN= gsub("-", "", smalldata$FederalTIN)colnames(smalldata)## [1] "BusinessYear"
## [2] "StateCode"
## [3] "IssuerId"
## [4] "SourceName"
## [5] "VersionNum"
## [6] "ImportDate"
## [7] "IssuerId2"
## [8] "FederalTIN"
## [9] "RateEffectiveDate"
## [10] "RateExpirationDate"
## [11] "PlanId"
## [12] "RatingAreaId"
## [13] "Tobacco"
## [14] "Age"
## [15] "IndividualRate"
## [16] "IndividualTobaccoRate"
## [17] "Couple"
## [18] "PrimarySubscriberAndOneDependent"
## [19] "PrimarySubscriberAndTwoDependents"
## [20] "PrimarySubscriberAndThreeOrMoreDependents"
## [21] "CoupleAndOneDependent"
## [22] "CoupleAndTwoDependents"
## [23] "CoupleAndThreeOrMoreDependents"
## [24] "RowNumber"
## [25] "date_diff"
data_for_boruta <-smalldata[ , -which(names(smalldata) %in% c("BusinessYear","ImportDate", "IssuerId2","FederalTIN","RateEffectiveDate","RateExpirationDate", "PlanId","IndividualTobaccoRate", "PrimarySubscriberAndOneDependent","PrimarySubscriberAndTwoDependents","PrimarySubscriberAndThreeOrMoreDependents","CoupleAndOneDependent","CoupleAndTwoDependents","CoupleAndThreeOrMoreDependents","RowNumber","date_diff", "Couple"))]
str(data_for_boruta)## 'data.frame': 80000 obs. of 8 variables:
## $ StateCode : Factor w/ 39 levels "AK","AL","AR",..: 20 27 19 27 34 27 4 11 38 34 ...
## $ IssuerId : int 11512 28162 23603 28162 61315 86728 51485 36096 31274 87226 ...
## $ SourceName : chr "0" "2" "2" "2" ...
## $ VersionNum : int 9 4 7 4 5 4 13 10 10 6 ...
## $ RatingAreaId : chr "14" "12" "3" "14" ...
## $ Tobacco : chr "1" "0" "0" "0" ...
## $ Age : chr "44" "22" "59" "65" ...
## $ IndividualRate: num 321.7 424 46.1 1342.6 30.1 ...
Begin feature selection
set.seed(123)
ppmi_boruta<-Boruta(IndividualRate~., data=data_for_boruta, doTrace=0)## Growing trees.. Progress: 64%. Estimated remaining time: 17 seconds.
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## Growing trees.. Progress: 65%. Estimated remaining time: 16 seconds.
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## Growing trees.. Progress: 70%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 54%. Estimated remaining time: 26 seconds.
## Growing trees.. Progress: 70%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 60%. Estimated remaining time: 20 seconds.
## Growing trees.. Progress: 69%. Estimated remaining time: 13 seconds.
## Computing permutation importance.. Progress: 52%. Estimated remaining time: 28 seconds.
## Growing trees.. Progress: 60%. Estimated remaining time: 20 seconds.
## Computing permutation importance.. Progress: 55%. Estimated remaining time: 25 seconds.
## Growing trees.. Progress: 62%. Estimated remaining time: 19 seconds.
## Computing permutation importance.. Progress: 55%. Estimated remaining time: 25 seconds.
## Growing trees.. Progress: 62%. Estimated remaining time: 19 seconds.
## Computing permutation importance.. Progress: 56%. Estimated remaining time: 24 seconds.
## Growing trees.. Progress: 63%. Estimated remaining time: 18 seconds.
## Computing permutation importance.. Progress: 54%. Estimated remaining time: 25 seconds.
## Growing trees.. Progress: 64%. Estimated remaining time: 17 seconds.
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
print(ppmi_boruta)## Boruta performed 10 iterations in 18.04925 mins.
## 7 attributes confirmed important: Age, IssuerId, RatingAreaId,
## SourceName, StateCode and 2 more;
## No attributes deemed unimportant.
ppmi_boruta$ImpHistory[1:6, 1:10]## StateCode IssuerId SourceName VersionNum RatingAreaId Tobacco
## [1,] 116.4772 118.9919 62.05204 117.0608 40.31077 206.1937
## [2,] 124.0483 120.0399 62.95507 118.8708 42.65647 201.1707
## [3,] 124.8317 124.5641 62.66332 117.2797 41.86787 200.3221
## [4,] 121.8676 119.4734 65.23306 121.1380 40.70576 191.2189
## [5,] 122.5822 121.4917 66.88530 122.7633 43.36275 204.0690
## [6,] 123.0402 127.2608 63.23441 119.0927 42.58185 205.3681
## Age shadowMax shadowMean shadowMin
## [1,] 394.7237 2.3047120 0.6090962 -1.960094
## [2,] 404.0171 1.0706027 -1.2910484 -3.039002
## [3,] 412.0495 0.1303121 -0.5714952 -1.742969
## [4,] 385.2483 3.2791698 1.1732179 -0.619702
## [5,] 377.3597 1.5825932 -0.2637642 -3.072958
## [6,] 409.9142 3.1580682 0.8119972 -1.209018
Plot the feature importance box plot as computed from Boruta above.
plot(ppmi_boruta, xaxt="n", xlab="")
lz<-lapply(1:ncol(ppmi_boruta$ImpHistory), function(i)
ppmi_boruta$ImpHistory[is.finite(ppmi_boruta$ImpHistory[, i]), i])
names(lz)<-colnames(ppmi_boruta$ImpHistory)
lb<-sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(ppmi_boruta$ImpHistory), cex.axis=.8, font = 4)Compute tentaive and confirm variables from boruta
final.ppmi_boruta<-TentativeRoughFix(ppmi_boruta)## Warning in TentativeRoughFix(ppmi_boruta): There are no Tentative
## attributes! Returning original object.
print(final.ppmi_boruta)## Boruta performed 10 iterations in 18.04925 mins.
## 7 attributes confirmed important: Age, IssuerId, RatingAreaId,
## SourceName, StateCode and 2 more;
## No attributes deemed unimportant.
final.ppmi_boruta$finalDecision## StateCode IssuerId SourceName VersionNum RatingAreaId
## Confirmed Confirmed Confirmed Confirmed Confirmed
## Tobacco Age
## Confirmed Confirmed
## Levels: Tentative Confirmed Rejected
getConfirmedFormula(final.ppmi_boruta)## IndividualRate ~ StateCode + IssuerId + SourceName + VersionNum +
## RatingAreaId + Tobacco + Age
## <environment: 0x7fe4c5e05c70>
print(final.ppmi_boruta$finalDecision[final.ppmi_boruta$finalDecision %in% c("Confirmed", "Tentative")])## StateCode IssuerId SourceName VersionNum RatingAreaId
## Confirmed Confirmed Confirmed Confirmed Confirmed
## Tobacco Age
## Confirmed Confirmed
## Levels: Tentative Confirmed Rejected
impBoruta <- final.ppmi_boruta$finalDecision[final.ppmi_boruta$finalDecision %in% c("Confirmed")]; length(impBoruta)## [1] 7
We use Neural Network to create the predictive model. We train the neural net on the data from
Convert some variables to numeric.
data_for_boruta$SourceName=as.numeric(data_for_boruta$SourceName)
data_for_boruta$RatingAreaId=as.numeric(data_for_boruta$RatingAreaId)
data_for_boruta$Tobacco=as.numeric(data_for_boruta$Tobacco)
data_for_boruta$Age=as.numeric(data_for_boruta$Age)
data_for_boruta$StateCode=as.numeric(data_for_boruta$StateCode)
str(data_for_boruta)## 'data.frame': 80000 obs. of 8 variables:
## $ StateCode : num 20 27 19 27 34 27 4 11 38 34 ...
## $ IssuerId : int 11512 28162 23603 28162 61315 86728 51485 36096 31274 87226 ...
## $ SourceName : num 0 2 2 2 0 2 0 2 2 0 ...
## $ VersionNum : int 9 4 7 4 5 4 13 10 10 6 ...
## $ RatingAreaId : num 14 12 3 14 23 12 4 11 1 19 ...
## $ Tobacco : num 1 0 0 0 0 0 0 1 1 1 ...
## $ Age : num 44 22 59 65 53 23 50 36 39 34 ...
## $ IndividualRate: num 321.7 424 46.1 1342.6 30.1 ...
# neural net
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
db_norm<-as.data.frame(lapply(data_for_boruta, normalize))
summary(db_norm)## StateCode IssuerId SourceName VersionNum
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2632 1st Qu.:0.2185 1st Qu.:0.0000 1st Qu.:0.1304
## Median :0.5263 Median :0.4391 Median :0.0000 Median :0.2609
## Mean :0.5282 Mean :0.4682 Mean :0.3093 Mean :0.2635
## 3rd Qu.:0.7895 3rd Qu.:0.7368 3rd Qu.:1.0000 3rd Qu.:0.3478
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## RatingAreaId Tobacco Age IndividualRate
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.000000
## 1st Qu.:0.04545 1st Qu.:0.0000 1st Qu.:0.4769 1st Qu.:0.006143
## Median :0.10606 Median :0.0000 Median :0.6462 Median :0.059485
## Mean :0.18100 Mean :0.4091 Mean :0.6484 Mean :0.064877
## 3rd Qu.:0.21212 3rd Qu.:1.0000 3rd Qu.:0.8308 3rd Qu.:0.095313
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.000000
Prepare the training and testing set for the model.
db_norm=db_norm[sample(nrow(db_norm),10000),]
sub<-sample(nrow(db_norm), floor(nrow(db_norm)*0.75))
train1<-db_norm[sub, ]
test<-db_norm[-sub, ]
final_test<-testWe train the Neural Network on data from 2014-2016 and do the predictions for the new set of plan rates from 2017.
set.seed(1234)
net_model1<-neuralnet(IndividualRate~ StateCode+ IssuerId+ SourceName+ VersionNum+ RatingAreaId+Tobacco +Age , data=train1 , hidden= c(3,2))model1_pred<-compute(net_model1, final_test[,c(1:7)])
pred_results<-model1_pred$net.result
cor(pred_results, final_test$IndividualRate)## [,1]
## [1,] 0.6401599288
Our model is very simple with few predictors deemed important by Boruta. The accuracy of prediction on the validation is just 0.60.
plot(net_model1)We borrowed the data of 2017 from the website of CMS and used that data as a validation set for our model.
val_2017 <- read.csv("Rate_2017.csv")colnames(val_2017)## [1] "BusinessYear"
## [2] "StateCode"
## [3] "IssuerId"
## [4] "SourceName"
## [5] "ImportDate"
## [6] "FederalTIN"
## [7] "RateEffectiveDate"
## [8] "RateExpirationDate"
## [9] "PlanId"
## [10] "RatingAreaId"
## [11] "Tobacco"
## [12] "Age"
## [13] "IndividualRate"
## [14] "IndividualTobaccoRate"
## [15] "Couple"
## [16] "PrimarySubscriberAndOneDependent"
## [17] "PrimarySubscriberAndTwoDependents"
## [18] "PrimarySubscriberAndThreeOrMoreDependents"
## [19] "CoupleAndOneDependent"
## [20] "CoupleAndTwoDependents"
## [21] "CoupleAndThreeOrMoreDependents"
We do not need all the columns here. Let’s take the important ones.
cleaned_val_2017<-subset(val_2017, select=c("RatingAreaId","StateCode","IssuerId","SourceName","Tobacco","Age","IndividualRate"))cleaned_val_2017$SourceName=as.numeric(cleaned_val_2017$SourceName)
cleaned_val_2017$RatingAreaId=as.numeric(cleaned_val_2017$RatingAreaId)
cleaned_val_2017$Tobacco=as.numeric(cleaned_val_2017$Tobacco)
cleaned_val_2017$Age=as.numeric(cleaned_val_2017$Age)
cleaned_val_2017$StateCode=as.numeric(cleaned_val_2017$StateCode)normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
cval<-as.data.frame(lapply(cleaned_val_2017[,c(1:7)], normalize))pred_rate <- compute(net_model1, cval)$net.result
plot(pred_rate, cleaned_val_2017$IndividualRate, xlim=c(0, 12), ylim=c(0, 12)); abline(0,1, col="red", lty=2)
legend("bottomright", c("Pred vs. Actual SQRT", "Pred=Actual Line"), cex=0.8, lty=c(1,2), lwd=c(2,2),col=c("black","red"))We see that the predictive power of the model is bad and is unable to predict the Individual plan rate for the year 2017.
p_a = read.csv("PlanAttributes.csv", stringsAsFactors = FALSE)The real test of how good a healthcare plan is can be difficult to assess, but one very crude benchmark is the maximum out of pocket. In layman’s terms, that’s the maximum a subscriber has to pay if the absolute worst happens. In this case, I’m looking at family MOOP. God forbid, say a family slid off the road during a snow storm and several people got hurt, they needed expensive surgery and rehab. The max out of pocket is the amount of money that the family would have to pay before the insurance covers everything 100%. We want to take a look at that number.
Here’s a quick glimpse and then some data cleaning.
head(p_a$TEHBInnTier1FamilyMOOP, 50)## [1] "" "" "" "" "" "" "$12,700"
## [8] "$8,000" "$8,000" "$12,700" "$0" "$9,500" "$12,000" "$0"
## [15] "" "" "" "$9,500" "$12,000" "$12,700" "$12,700"
## [22] "$0" "" "" "" "" "" "$12,700"
## [29] "$12,000" "$9,500" "$12,000" "$12,700" "$10,400" "$2,500" "$1,000"
## [36] "" "$9,500" "$0" "$12,700" "$10,400" "$2,500" "$1,000"
## [43] "$12,700" "" "" "" "" "" ""
## [50] ""
p_a$TEHBInnTier1FamilyMOOP<- gsub(',', '', p_a$TEHBInnTier1FamilyMOOP)
p_a$TEHBInnTier1FamilyMOOP<- gsub('\\$', '', p_a$TEHBInnTier1FamilyMOOP)
p_a$moop<- as.numeric(p_a$TEHBInnTier1FamilyMOOP)## Warning: NAs introduced by coercion
p_a$moop[is.na(p_a$moop)] <- 0
ggplot(p_a, aes(x = p_a$moop)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There’s a lot of plans in there that have a zero family MOOP. That’s not accurate. We will only stick to plans that actually have a dollar amount.
We are going to map this to see which states have the worst MOOP on average for a family. We have used a function that turns state abbreviations to a format that choropleth can actually use.
stateFromLower <-function(x) {
#read 52 state codes into local variable [includes DC (Washington D.C. and PR (Puerto Rico)]
st.codes<-data.frame(
state=as.factor(c("AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME",
"MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM",
"NV", "NY", "OH", "OK", "OR", "PA", "PR", "RI", "SC", "SD", "TN",
"TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY")),
full=as.factor(c("alaska","alabama","arkansas","arizona","california","colorado",
"connecticut","district of columbia","delaware","florida","georgia",
"hawaii","iowa","idaho","illinois","indiana","kansas","kentucky",
"louisiana","massachusetts","maryland","maine","michigan","minnesota",
"missouri","mississippi","montana","north carolina","north dakota",
"nebraska","new hampshire","new jersey","new mexico","nevada",
"new york","ohio","oklahoma","oregon","pennsylvania","puerto rico",
"rhode island","south carolina","south dakota","tennessee","texas",
"utah","virginia","vermont","washington","wisconsin",
"west virginia","wyoming"))
)
#create an nx1 data.frame of state codes from source column
st.x<-data.frame(state=x)
#match source codes with codes from 'st.codes' local variable and use to return the full state name
refac.x<-st.codes$full[match(st.x$state,st.codes$state)]
#return the full state names in the same order in which they appeared in the original source
return(refac.x)
}moop <- subset(p_a, moop > 0)
# map this to see which states have the worst MOOP on average for a family.
df <- aggregate(p_a$moop, list(p_a$StateCode), mean)
df$region<-stateFromLower(df$Group.1)
df$value <- df$x
choro = StateChoropleth$new(df)
choro$title = "Average Max Out of Pocket"
choro$set_num_colors(1)
myPalette <- colorRampPalette(brewer.pal(9, "Reds"))
choro$ggplot_polygon = geom_polygon(aes(fill = value), color = NA)
choro$ggplot_scale = scale_fill_gradientn(name = "MOOP", colours = myPalette(9))
choro$render()## Warning: Column `region` joining character vector and factor, coercing into
## character vector
## Warning in self$bind(): The following regions were missing and are being
## set to NA: minnesota, california, maryland, colorado, washington, vermont,
## kentucky, massachusetts, connecticut, new york, rhode island, district of
## columbia
Idaho is the worst, followed by Arizona and New Mexico. Things look pretty uniform throughout the rest of the country, however, we want to look how MOOP has changed over time as well. Our data has only have two years for the ACA: 2014 and 2015. We would like to see if MOOP has gotten higher.
moop14 <- subset(moop, BusinessYear == "2014")
dim(moop14)## [1] 11763 177
table(moop14$moop)##
## 400 500 600 700 800 900 950 1000 1100 1150 1200 1240
## 2 1 1 2 40 3 3 178 5 8 39 1
## 1300 1400 1500 1508 1600 1660 1700 1900 2000 2200 2300 2350
## 29 44 115 16 12 4 5 1 266 25 14 12
## 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500
## 29 126 14 6 65 39 385 2 1 4 20 35
## 3600 3700 3750 3800 3900 4000 4200 4230 4300 4400 4500 4600
## 13 14 1 19 1 302 30 8 12 9 533 6
## 4700 5000 5200 5300 5350 5400 5500 5600 5800 5840 5900 6000
## 5 212 28 7 10 12 19 13 12 3 2 485
## 6250 6300 6400 6500 6600 6750 6800 6900 7000 7050 7200 7300
## 2 3 21 24 2 1 4 2 409 10 26 16
## 7400 7500 7600 7700 7800 7900 8000 8200 8250 8300 8400 8500
## 8 22 27 1 2 4 380 5 8 3 30 23
## 8700 8800 9000 9100 9200 9300 9400 9500 9600 9700 9750 9800
## 3 16 303 4 4 4 6 56 13 3 42 10
## 10000 10160 10200 10300 10338 10360 10400 10500 10600 11000 11200 11400
## 741 4 18 2 4 4 402 83 7 181 2 2
## 11500 11600 12000 12200 12400 12500 12600 12650 12675 12700
## 10 31 530 4 2 332 341 3 2 4253
summary(moop14$moop)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 400.000 6000.000 10400.000 9207.298 12700.000 12700.000
moop15 <- subset(moop, BusinessYear == "2015")
dim(moop15)## [1] 22314 177
table(moop15$moop)##
## 300 400 500 700 800 850 900 950 1000 1050 1100 1150
## 1 4 1 3 51 2 17 6 469 10 43 8
## 1200 1240 1250 1300 1350 1400 1450 1500 1508 1520 1600 1650
## 100 1 1 46 4 64 3 263 23 2 15 2
## 1660 1700 1750 1800 1900 2000 2100 2200 2300 2400 2500 2600
## 2 10 2 1 1 387 1 40 21 52 143 36
## 2700 2800 2900 2950 3000 3050 3100 3200 3250 3300 3400 3450
## 24 82 174 1 688 1 5 23 3 10 24 1
## 3500 3600 3650 3700 3750 3800 3850 3900 4000 4100 4150 4200
## 62 47 1 47 9 12 2 11 691 6 4 88
## 4250 4300 4400 4500 4600 4800 5000 5100 5200 5300 5350 5400
## 6 12 23 668 9 15 232 5 29 20 2 9
## 5500 5600 5800 5840 5900 6000 6150 6200 6250 6350 6400 6500
## 98 50 25 6 4 611 3 9 4 3 16 33
## 6600 6700 6750 6800 6900 6950 7000 7050 7100 7200 7300 7400
## 17 6 2 7 9 2 835 6 1 50 33 9
## 7450 7500 7600 7650 7700 7800 7850 7900 8000 8100 8150 8200
## 4 15 53 2 1 10 2 10 593 2 1 16
## 8250 8300 8400 8500 8600 8700 8800 8850 8900 9000 9050 9100
## 13 7 39 99 1 5 17 2 1 451 1 6
## 9150 9200 9300 9400 9450 9500 9600 9650 9700 9750 9800 9850
## 1 64 12 27 1 79 31 2 42 229 25 5
## 9900 9950 10000 10050 10100 10200 10300 10338 10360 10400 10500 10600
## 9 3 1053 2 2 46 6 4 4 668 283 19
## 10700 10800 10850 10950 11000 11050 11100 11150 11200 11250 11300 11400
## 3 9 3 8 435 3 6 2 12 2 126 18
## 11450 11500 11550 11600 11700 11750 11800 11900 12000 12050 12200 12400
## 5 46 3 40 2 3 18 6 963 3 7 17
## 12450 12500 12550 12600 12650 12700 12800 12900 13000 13100 13200
## 3 304 3 558 3 4558 108 559 391 9 3390
One thing to note here: there are LOTS more total plans in 2015. Almost twice as many, actually. That means we need to think about how to display this visually to prevent misleading.
summary(moop15$moop)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 300.000 6000.000 11175.000 9498.264 12700.000 13200.000
moop16 <- subset(moop, BusinessYear == "2016")
dim(moop16)## [1] 0 177
Now, this is where things get very interesting. The max MOOP in 2014 was $12,700 and there are many plans with that MOOP. 4253 in total. That’s about a third of all plans at the max MOOP. But then in 2015 things change. The max MOOP goes up to $13,200. And now many plans have higher MOOPs. Now over 9000 plans have family MOOPs of $12,700+. That’s a huge increase. As the MOOP ceiling has gone up, health insurers have moved their MOOP up as well. That’s a worrying trend.
Remebering that the difference between the total count of plans in 2014 and 2015 is large, therefore, we use percentages to display the information in a way that makes sense.
g1<-ggplot(moop14, aes(x = moop14$moop)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 500, col= "black", fill= "orange") +
# for 2014
scale_y_continuous(labels=percent) + labs(title = "MOOP in 2014") + labs(x="Family MOOP", y= "Percent of Total Plans") + theme(text=element_text(size=15))
g2 <- ggplot(moop15, aes(x = moop15$moop)) +
geom_histogram(aes(y = (..count..)/sum(..count..)), binwidth = 500, col= "black", fill= "darkgreen") +
# for 2015
scale_y_continuous(labels=percent) + labs(title = "MOOP in 2015") + labs(x="Family MOOP", y= "Percent of Total Plans") + theme(text=element_text(size=15))
grid.arrange(g1, g2, ncol = 2)While the ACA has obviously been a huge benefit to families who need it, it’s a little scary to note that that many plans offer the poorest coverage possible under the ACA. It will be interesting to see what happens over the next five years. Will HHS keep allowing MOOP to rise or will they push back? If this data is any indication, health insurers will continue to raise MOOP if they are allowed.
What we are looking for is a relationship between liberal states and better health insurance through the exchanges. There are many different ways to quantify “good” health insurance, but we have used two measures - Maximum Out of Pocket Costs for a Family - Monthly Premiums
moop_final = subset(p_a, BusinessYear == "2014" & DentalOnlyPlan == "No")After doing a quick exploration we see that this dataset has dental insurance mixed in with health insurance.
#drop dental insurance plans.
head(moop_final$TEHBInnTier1FamilyMOOP, 50)## [1] "12700" "8000" "8000" "12700" "0" "9500" "12000" "0"
## [9] "9500" "12000" "12700" "12700" "0" "12700" "12000" "9500"
## [17] "12000" "12700" "10400" "2500" "1000" "9500" "0" "12700"
## [25] "10400" "2500" "1000" "12700" "2500" "12700" "1000" "12000"
## [33] "12700" "12000" "0" "12000" "9500" "0" "9500" "12700"
## [41] "12700" "8000" "12700" "8000" "12700" "12700" "12700" "12700"
## [49] "0" "8000"
#removing '$' and comma signs
moop_final$TEHBInnTier1FamilyMOOP<- gsub('\\$', '', moop_final$TEHBInnTier1FamilyMOOP)
moop_final$TEHBInnTier1FamilyMOOP<- gsub(',', '', moop_final$TEHBInnTier1FamilyMOOP)
moop_final$MOOP <- as.numeric(moop_final$TEHBInnTier1FamilyMOOP)## Warning: NAs introduced by coercion
moop_final$MOOP[is.na(moop_final$MOOP)] <- 0
head(moop_final$MOOP, 50)## [1] 12700 8000 8000 12700 0 9500 12000 0 9500 12000 12700
## [12] 12700 0 12700 12000 9500 12000 12700 10400 2500 1000 9500
## [23] 0 12700 10400 2500 1000 12700 2500 12700 1000 12000 12700
## [34] 12000 0 12000 9500 0 9500 12700 12700 8000 12700 8000
## [45] 12700 12700 12700 12700 0 8000
counts <- table(moop_final$MOOP)
#counts
barplot(counts, main="Max Out of Pocket", xlab="Amount (in $)")The scale is very bimodal. Very right and left censored. We see that the max value is 12700 and that over 4000 plans have that as their max out of pocket. After doing some research we find that $12,700 is the maximum value for MOOP in plans available through the ACA. Makes sense. Now, we want to find out what the average MOOP is for each state that is contained in the dataset.
moop_agg <- aggregate(moop_final$MOOP, list(moop_final$StateCode), mean)
names(moop_agg) <- c("State", "Moop")
moop_agg## State Moop
## 1 AK 7895.238095
## 2 AL 7861.702128
## 3 AR 7756.797251
## 4 AZ 7242.530541
## 5 DE 6350.588235
## 6 FL 5706.446414
## 7 GA 7593.617021
## 8 IA 7450.909091
## 9 ID 7509.278351
## 10 IL 8240.188088
## 11 IN 7738.820302
## 12 KS 6594.807692
## 13 LA 6854.646925
## 14 ME 7305.755396
## 15 MI 6297.266515
## 16 MO 7071.164021
## 17 MS 7516.126761
## 18 MT 7775.789474
## 19 NC 7743.648208
## 20 ND 8026.666667
## 21 NE 7861.867089
## 22 NH 6695.348837
## 23 NJ 7566.513761
## 24 NM 8171.707317
## 25 OH 7099.120980
## 26 OK 7777.937337
## 27 PA 7026.809651
## 28 SC 4445.089286
## 29 SD 7007.658537
## 30 TN 6566.953317
## 31 TX 7849.168053
## 32 UT 7293.362832
## 33 VA 7813.603819
## 34 WI 6619.328165
## 35 WV 6195.161290
## 36 WY 7544.761905
Now, we need to find a measure of ideology. Thankfully, Richard Fording has a dataset that contains a score for each state in the United States. The latest scores are for the year 2014, that’s why we only used that year in our earlier subsetting. There wasn’t a really good way to do this using R, so we just created the vector by hand.
Higher values is more liberal and lower values is more conservative. South Carolina has a score of 0. The most conservative state.
Full data avaialble here: https://rcfording.wordpress.com/state-ideology-data/
Using Ideology data by Richard Fording on this webiste : https://rcfording.wordpress.com/state-ideology-data/ Handpicking the Ideology values form the dataset on the above mentioned webiste, we get the following vector.
moop_agg$Ideology <- c(35.44, 19.05, 48.89, 3.02, 76.58, 11.33, 3.12, 34.38, 8.78, 83.17, 10.24, 5.38, 14.02, 67.14, 11.17, 47.6, 26.71, 43.46, 6.65, 26.91, 15.68, 66.01, 54.12, 40.63, 11.85, 8.75, 27.45, 0, 22.85, 10.68, 6.97, 6.31, 51.35, 6.10, 72.81, 5.14)
moop_agg## State Moop Ideology
## 1 AK 7895.238095 35.44
## 2 AL 7861.702128 19.05
## 3 AR 7756.797251 48.89
## 4 AZ 7242.530541 3.02
## 5 DE 6350.588235 76.58
## 6 FL 5706.446414 11.33
## 7 GA 7593.617021 3.12
## 8 IA 7450.909091 34.38
## 9 ID 7509.278351 8.78
## 10 IL 8240.188088 83.17
## 11 IN 7738.820302 10.24
## 12 KS 6594.807692 5.38
## 13 LA 6854.646925 14.02
## 14 ME 7305.755396 67.14
## 15 MI 6297.266515 11.17
## 16 MO 7071.164021 47.60
## 17 MS 7516.126761 26.71
## 18 MT 7775.789474 43.46
## 19 NC 7743.648208 6.65
## 20 ND 8026.666667 26.91
## 21 NE 7861.867089 15.68
## 22 NH 6695.348837 66.01
## 23 NJ 7566.513761 54.12
## 24 NM 8171.707317 40.63
## 25 OH 7099.120980 11.85
## 26 OK 7777.937337 8.75
## 27 PA 7026.809651 27.45
## 28 SC 4445.089286 0.00
## 29 SD 7007.658537 22.85
## 30 TN 6566.953317 10.68
## 31 TX 7849.168053 6.97
## 32 UT 7293.362832 6.31
## 33 VA 7813.603819 51.35
## 34 WI 6619.328165 6.10
## 35 WV 6195.161290 72.81
## 36 WY 7544.761905 5.14
plot(moop_agg$Moop , moop_agg$Ideology)ggplot(moop_agg, aes(x=moop_agg$Moop, y=moop_agg$Ideology, fill = moop_agg$State)) +
geom_bar(stat = "Identity", width = 20)## Warning: position_stack requires non-overlapping x intervals
Inspired from https://github.com/apalbright/NewYorker/blob/master/scatter.R
my_theme <- function() {
# Define colors for the chart
palette <- brewer.pal("Greys", n=9)
color.background = palette[2]
color.grid.major = palette[4]
color.panel = palette[3]
color.axis.text = palette[9]
color.axis.title = palette[9]
color.title = palette[9]
# Create basic construction of chart
theme_bw(base_size=9, base_family="Georgia") +
# Set the entire chart region to a light gray color
theme(panel.background=element_rect(fill=color.panel, color=color.background)) +
theme(plot.background=element_rect(fill=color.background, color=color.background)) +
theme(panel.border=element_rect(color=color.background)) +
# Format grid
theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) +
theme(panel.grid.minor=element_blank()) +
theme(axis.ticks=element_blank()) +
# Format legend
theme(legend.position="right") +
theme(legend.background = element_rect(fill=color.panel)) +
theme(legend.text = element_text(size=10,color=color.axis.title)) +
# Format title and axes labels these and tick marks
theme(plot.title=element_text(color=color.title, size=20, vjust=0.5, hjust=0, face="bold")) +
theme(axis.text.x=element_text(size=10,color=color.axis.text)) +
theme(axis.text.y=element_text(size=10,color=color.axis.text)) +
theme(axis.title.x=element_text(size=12,color=color.axis.title, vjust=-1, face="italic")) +
theme(axis.title.y=element_text(size=12,color=color.axis.title, vjust=1.8, face="italic")) +
# Plot margins
theme(plot.margin = unit(c(.5, .5, .5, .5), "cm"))
}We are going to throw a regression line on the visualization just to give a sense of relationship.
ggplot(moop_agg, aes(Moop, Ideology))+
my_theme()+
geom_point(shape=1) +
geom_smooth(method=lm)+
labs(title= "", x="Max Out of Pocket for Family", y="State Ideology Scores")+
ggtitle(expression(atop(bold("Do Blue States Fare Better Under the ACA?"), atop(italic("Association between State Liberalism and Max Out of Pocket"),""))))+
geom_text(aes(label=State), vjust=-1, hjust=0.5, size=2)+theme(plot.title = element_text(size = 16, colour = "black", vjust = 0.5, hjust=0.5))Looks like more liberal states actually have HIGHER overall MOOPs and more conservative states have LOWER MOOPs. However the relationship isn’t statistically significant.
We want to do the same thing for premiums. However, we need to load a new dataset.
df2 <- aggregate(rate_final_2014$IndividualRate, list(rate_final_2014$StateCode), mean)
names(df2) <- c("State", "Rate")
df2## State Rate
## 1 AK 687.4846512
## 2 AL 310.4216273
## 3 AR 191.7281400
## 4 AZ 358.3430163
## 5 DE 303.5045784
## 6 FL 225.0343449
## 7 GA 236.2480380
## 8 IA 352.0917299
## 9 ID 335.5906603
## 10 IL 393.6916331
## 11 IN 470.4898775
## 12 KS 288.8863991
## 13 LA 388.6851162
## 14 ME 386.0573926
## 15 MI 286.4078646
## 16 MO 157.7844068
## 17 MS 336.7787666
## 18 MT 266.3338661
## 19 NC 343.6952749
## 20 ND 333.7062557
## 21 NE 339.7309144
## 22 NH 309.9268235
## 23 NJ 431.8352962
## 24 NM 262.8701088
## 25 OH 414.9238247
## 26 OK 369.7863552
## 27 PA 349.6579366
## 28 SC 302.8205983
## 29 SD 441.6457359
## 30 TN 323.4958361
## 31 TX 212.6574451
## 32 UT 282.5370221
## 33 VA 414.1727704
## 34 WI 474.7037436
## 35 WV 206.1047558
## 36 WY 451.4464911
Now we will merge this new dataframe with the previous one that is constructed.
total <- merge(moop_agg,df2,by=c("State"))
total## State Moop Ideology Rate
## 1 AK 7895.238095 35.44 687.4846512
## 2 AL 7861.702128 19.05 310.4216273
## 3 AR 7756.797251 48.89 191.7281400
## 4 AZ 7242.530541 3.02 358.3430163
## 5 DE 6350.588235 76.58 303.5045784
## 6 FL 5706.446414 11.33 225.0343449
## 7 GA 7593.617021 3.12 236.2480380
## 8 IA 7450.909091 34.38 352.0917299
## 9 ID 7509.278351 8.78 335.5906603
## 10 IL 8240.188088 83.17 393.6916331
## 11 IN 7738.820302 10.24 470.4898775
## 12 KS 6594.807692 5.38 288.8863991
## 13 LA 6854.646925 14.02 388.6851162
## 14 ME 7305.755396 67.14 386.0573926
## 15 MI 6297.266515 11.17 286.4078646
## 16 MO 7071.164021 47.60 157.7844068
## 17 MS 7516.126761 26.71 336.7787666
## 18 MT 7775.789474 43.46 266.3338661
## 19 NC 7743.648208 6.65 343.6952749
## 20 ND 8026.666667 26.91 333.7062557
## 21 NE 7861.867089 15.68 339.7309144
## 22 NH 6695.348837 66.01 309.9268235
## 23 NJ 7566.513761 54.12 431.8352962
## 24 NM 8171.707317 40.63 262.8701088
## 25 OH 7099.120980 11.85 414.9238247
## 26 OK 7777.937337 8.75 369.7863552
## 27 PA 7026.809651 27.45 349.6579366
## 28 SC 4445.089286 0.00 302.8205983
## 29 SD 7007.658537 22.85 441.6457359
## 30 TN 6566.953317 10.68 323.4958361
## 31 TX 7849.168053 6.97 212.6574451
## 32 UT 7293.362832 6.31 282.5370221
## 33 VA 7813.603819 51.35 414.1727704
## 34 WI 6619.328165 6.10 474.7037436
## 35 WV 6195.161290 72.81 206.1047558
## 36 WY 7544.761905 5.14 451.4464911
ggplot(total, aes(x=Rate, y=Ideology))+
my_theme()+
geom_point(shape=1) +
geom_smooth(method=lm)+
labs(title= "", x="Monthly Premium", y="State Ideology Scores")+
ggtitle(expression(atop(bold("Do Blue States Fare Better Under the ACA?"), atop(italic("Association between liberal scores and Health Insurance Premiums"),""))))+
geom_text(aes(label=State), vjust=-1, hjust=0.5, size=2)+
theme(plot.title = element_text(size = 16, face = "bold", colour = "black", vjust = 0.5, hjust=0.5))Here the relationship is a negative one. More liberal states do have lower monthly health insurance premiums, however this relationship isn’t statistically signficant, either.
We generated some interesting and meaningful insights from preliminary visualizations and the further analysis that was carried out.
We found out that the states providing most expensive insurance plans on an average are Wyoming and Wisconsin. We also discovered a mostly linear relationship between Age and Individual Plan Rate implying that people in the higher age brackets provide the primary market for revenue generation for health insurance providers/carriers.
The classification results indicated that the most important association rules for different business rules across benefit plans are Domestic Partner as Spouse Indicator and Same Sex Partner as Spouse Indicator.
Another interesting conclusion that was drawn was from the MOOP (Max Out of Pocket) Analysis. As we compared data for the same for different years, there was a clear indication of health insurers raising MOOP every year, if allowed, which doesn’t sound good in the long run. Also, this analysis provided us insight into the state ideology. We found out that the liberality of states is directly proportional to MOOP and inversely proportional to the monthly premiums by state.
We also trained a neural net in order to build a predictive model for the Individual Plan Rate using data from past years. The features (most importantly, Age, Plan Validity/Duration and Tobacco Preference) for the model were carefully selected using the Boruta Package. This model can help us predict the plan rates for any future data.
We would like to explore more feature selection methods such as RFE (Recursive Feature Elimination) and Stepwise Feature Selection. This may help provide us with better features to build even more accurate predictive models.
Also, in order to overcome the major challenge of huge dataset size ( creating computational limitations) and hence, perform overall better analysis, we would like to make us of SparkR as well as GPU Computing. This will allow us to better handle big data and generate more accurate and relevant insights.
Also, since the dataset was incomplete, providing us information for only 39 states, as part of future work we would also like to carry out analysis on a complete dataset.
In performing our assignment, we had to take the help and guideline of numerous online resources provided by the University of Michigan, which deserve our greatest gratitude. The completion of this assignment gives us much Pleasure. We would like to show our gratitude to Mr. Ivo Dinov, Course Instructor, Data Science and Predictive Analytics(HS650), University of Michigan, Ann Arbor who introduced us to the Methodology of work, and whose passion for the “underlying structures” had lasting effect on all of us. We reiterate our gratitude to professor for giving us a good guideline for assignment throughout numerous consultations. Many people, especially our classmates and team members itself, have made valuable comment suggestions on this proposal which gave us an inspiration to improve our assignment. We thank all the people for their help directly and indirectly to complete our assignment.
Dinov, ID. (2018) Data Science and Predictive Analytics: Biomedical and Health Applications using R, Springer (ISBN 978-3-319-72346-4).
https://www.cms.gov/cciio/resources/data-resources/marketplace-puf.html#
http://www.socr.umich.edu/people/dinov/courses/DSPA_Topics.html
http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/03_DataVisualization.html
http://www.socr.umich.edu/people/dinov/courses/DSPA_notes/10_ML_NN_SVM_Class.html
https://www.kaggle.com/ryanburge/max-out-of-pocket-in-the-aca-going-up
https://www.statista.com/topics/1530/health-insurance-in-the-us/
https://www.healthcare.gov/glossary/affordable-care-act/
https://www.kaggle.com/ryanburge/max-out-of-pocket-in-the-aca-going-up